Statistical Learning Methods for High Dimensional Genomic Data Statistical Learning Methods for High Dimensional Genomic Data Title: Statistical Learning Methods for High Dimensional Genomic Data

نویسنده

  • Salvatore Masecchia
چکیده

Due to their high-dimensionality, -omics technologies require the development of computational methods that are able to work with large number of variables. Each data type is characterized by its method of measurement and by the biological aspect under study. Understanding the data properties allows the design of sophisticated and effective computational models that are able to uncover and explain complex biological phenomena. This thesis aims at exploring the use of statistical learning methods for dealing with different high-throughput molecular data, in order to answer heterogeneous biological questions related to various diseases. We address problems at different biological levels (e.g. gene expression or genomic alteration) but exploiting different peculiarities of the data under analysis. We propose a computational framework in which biological questions can be modeled as solution of a minimization problem of a functional where data properties are described via a composition of penalties and constraints. This framework includes a wide range of regularized least squares and regularized matrix factorization methods. We focus on two main questions. First, we apply the `1`2-norms regularization to extract gene signatures from gene expression data related to neurodegenerative diseases like Alzheimer and Parkinson. Such feature selection method is nested in a pipeline where functionally related pathways are extracted from the list of relevant genes. The last step of the pipeline, moreover, plans to infer interaction network related to each pathways from the data in order to evaluate differences between different phenotypes (e.g. patients vs controls). Second, dealing with aCGH data, in the context of Dictionary Learning, we combine a set of penalties (e.g. `1-norm and Total Variation) and hard constraints in order to automatically detect common genomic alterations from a set of high risk Neuroblastoma patients. Genomic alterations identified by the regularized method are used as input of an algorithm for oncogenesis tree estimation. Finally, we present a set of well structured software modules, tools and libraries that implement the above methods and models.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases

Recent advances of information technology in biomedical sciences and other applied areas have created numerous large diverse data sets with a high dimensional feature space, which provide us a tremendous amount of information and new opportunities for improving the quality of human life. Meanwhile, great challenges are also created driven by the continuous arrival of new data that requires rese...

متن کامل

Robust high-dimensional semiparametric regression using optimized differencing method applied to the vitamin B2 production data

Background and purpose: By evolving science, knowledge, and technology, we deal with high-dimensional data in which the number of predictors may considerably exceed the sample size. The main problems with high-dimensional data are the estimation of the coefficients and interpretation. For high-dimension problems, classical methods are not reliable because of a large number of predictor variable...

متن کامل

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

Learning with Sparsity: Structures, Optimization and Applications

The development of modern information technology has enabled collecting data of unprecedented size and complexity. Examples include web text data, microarray & proteomics, and data from scientific domains (e.g., meteorology). To learn from these high dimensional and complex data, traditional machine learning techniques often suffer from the curse of dimensionality and unaffordable computational...

متن کامل

Jaime Carbonell ( Chair ) Tom Mitchell

The development of modern information technology has enabled collecting data of unprecedented size and complexity. Examples include web text data, microarray & proteomics, and data from scientific domains (e.g., meteorology). To learn from these high dimensional and complex data, traditional machine learning techniques often suffer from the curse of dimensionality and unaffordable computational...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013